Symbolic Execution Based Data Flow Analysis for Optimizing Compilers: Proof of Semantic Equivalence of a Program and Code Generated from the Symbolic Execution Based Data Flow Analysis
نویسندگان
چکیده
In this technical report, the proof of equivalence of a program with the code generated from it using symbolic execution based data flow analysis is presented. 1. Definitions Control Flow Graph (CFG): A directed graph whose nodes are primitive statements such as arithmetic, load, store, two-way branch, and two-way join. Its edges represent transitions from a statement to another statement as a result of executing the first statement. State: A state σ (as defined in operational semantics) is a mapping from (address) numbers to (value) numbers that represents the memory. Extended state: An extended state Σ is a pair (e, σ), where e is an edge on the control flow graph and σ is a program state. Logical Assertion: A logical assertion (LA) is a pair (P,M), where P is a symbolic predicate expression and M is a statement, both expressed in extended first order logic. After symbolic execution, a logical assertion is associated with every edge in the control flow graph. The statement M is treated as a mapping from address expressions to value expressions written in terms of symbolic variables. Code Generation: Define f to be a function that takes a logical assertion L and a program state σ. From the logical assertion, it generates a program P = f(L, σ). The iteration counts of loops that are not terminated yet do not exist in L. The state σ is only used to look up the values of these free loop index variables, providing iteration counts for loops that are not terminated yet. CFG Execution: Define executeG to be a function that takes an input extended state Σ, a CFGG, and an integer k (k ≥ 0) to execute G for k instructions starting from initial state Σ. It generates an extended output state Σ′ = executeG(Σ, P, k). Program Execution: Define executeP to be a function that takes a state σ and a program P to execute P until completion starting from σ. It generates an output state σ′ = executeP (σ, P ). 2. Claim Let P be a program and G be the control flow graph obtained from P . Let e0 be the entry edge of G and let σ0 be any state. Assume that starting from the extended state (e0, σ0) and executing k statements over G results in an extended state (e1, σ1), where 0 ≤ k. Let L be the logical assertion on edge e1 after performing symbolic execution based data flow analysis on the CFG G. Let P ′ denote the program obtained from L. Then, (1) starting from σ0 and executing P ′ results in state σ′, where σ′ = σ1; and (2) evaluating the predicate of e1 at state σ1 results in true. Note: Let G′ be the CFG obtained from P ′. The edge of G′ reached as a result of executing P ′ is irrelevant since the control flow graph G′ is optimized. 3. Assumptions • We are not considering dead variables that are eliminated as a result of symbolic execution. Actually, the states reached are not equal but are equivalent in terms of only live variables. • We are considering only single entry, single exit loops. This proof can easily be generalized to multiple entry multiple exit loops. • We treat registers as special, reserved memory locations that can only be accessed through the name of the register. Therefore, a register can never be aliased with another memory location. For instance, for a register r1, M [r1] describes the contents of register r1. • We assume that expression simplification is a valid transformation. As a result, execution of a program obtained from a logical assertion reaches the same state as the program obtained from its simplified version: executeP (σ0, f(simplify(P,E), σ)) = executeP (σ0, f(E, σ)), for any σ, σ0. 4. Notation Let val and addr be numbers representing a value and an address, respectively. Let vexpr, aexpr, pred, expr, expr1, and expr2 be symbolic expressions. • σ(addr): This returns the value of address addr in state σ. • σ[val/addr]: In state σ, the contents of address addr is updated with value val. • L(aexpr): In logical assertion L, look up the value expression for the address expression aexpr. • L[vexpr/aexpr]: In logical assertion L, update the contents of address expression aexpr with value expression vexpr. • simplify(pred, expr): Simplify the symbolic expression expr assuming that pred holds. • simplify(bexpr): Simplify the symbolic Boolean expression bexpr. • expr1 ⊕ expr2: Symbolic addition of two symbolic expressions. • When the predicate is not relevant, logical assertions and expression maps are used interchangeably. As a result, L(aexpr) = E(aexpr) and L[vexpr/aexpr] = E[vexpr/aexpr], where L = (P,E). • M [aexpr] == vexpr: The contents of memory address aexpr is equal to vexpr. The letter M stands for memory. • When there are no free index variables of interest in a logical assertion L, we omit the second parameter σ to the code generation function f and write f(L). • eval(P, σ): Evaluate the predicate P using the values for the symbolic variables from the state σ and return a Boolean value. Note that eval is distributive over logical and (∧) and or (∨) operators. 5. Extended First Order Logic and Operations on Logical Assertions 5.1 Extended First Order Logic In order to simplify the complexity of representation of sequential execution, we extend the first order logic with two operators: (1) a semicolon (;) operator that combines two statements; and (2) a for operator that combines multiple statements with an index variable over a range given as a parameter (e.g., for(0 ≤ j ≤ N) [L(j)]). A logical assertion can be parsed with the following grammar: LA ::= subLA(;LA)? | if(bexpr) then LA else LA endif (;LA)? | subLA ::= unit (∧ unit) unit ::= M [expr] == expr | (for | forall) (0 ≤ I ≤ IFINISH()− 1) [ LA ] 5.2 Operations on Logical Assertions In this section, we describe the look-up and update operations on logical assertions. Let A, A′ be symbolic address expressions and v, v1, v2 be symbolic value expressions. 1. Although a LA is an extended first order statement, it can be considered as a mapping from address expressions to value expressions. 2. A LA is composed of a number of address expression to value expression mappings in the form: M [aexpr] == vexpr. These mappings are bonded by two types of connectors: (1) a commutative and (∧) operator, (2) a non-commutative semicolon (;) operator. A LA can be treated as a set of sub-LAs connected by semicolons. Logical assertions can also contain if statements that multiplex two logical assertions into one. 3. When a look up for a symbolic address expressionA on a logical assertion L, written as L(A), is requested, L is assumed to be true and it is searched for a sub-assertion of the form {M [A] == v} (with an exact match of the left hand side). The sub-LAs connected with semicolons are searched from right to left (Note that the for operator is a generalization of the semicolon operator.). This look up order ensures that when there are multiple sub-assertions that provide a value for A, the value returned will be its most recent update. If A does not exist in the logical assertion, then the following expression is returned for M [A]: for all A′ such that we cannot prove A 6= A′, if(A = A) then L(A) else M [A] Example: Consider the following logical assertion L: L = (true,M [a] == 0;M [b] == 1) Then, if we look up the value for address expression c, we get: L(c) = if(b == c) then 1 else if (a == c) then 0 else M [c]. 4. Alias analysis guarantees the following: For a sub-assertion L = {M [A1] == v1 ∧ M [A2] == v2}, and for some σ, if(Lσ(A1) = Lσ(A2)) then σ(v1) = σ(v2), where Lσ denotes the replacement of all symbolic variables in L with their values in σ. This means that in the some state σ, if two address expressions evaluate to the same concrete address, then their value expressions must evaluate to equal concrete values. 5. When an update L[v/A] on logical assertion is requested, firstly a new assertion of the form {L;M [A] == v} is created. Then, we check the validity of converting the semicolon operator into an and operator with a dependence analysis. If it is possible, then the any existing M [A] == v0 mapping in the last sub-assertion of L is removed, and the value expressions for all addresses that might be aliased with A are updated with conditioanl expressions. Example: Consider the following logical assertion over the symbolic variables a, b, and x: L = {{M [a] == 2 ∗ x ∧ M [b] == 2 ∗ x+ 1}; {M [a] == 3 ∗ x}} In this example, for illustrative purposes, assume (1) the semicolon operator connecting the initial two sub-LAs cannot be converted into an and operator; and (2) the variables a and b are not aliased. Then, L(a) = 3 ∗ x L(b) = 2 ∗ x+ 1 L[5/a] = {M [a] == 2 ∗ x ∧ M [b] == 2 ∗ x+ 1; M [a] == 5} L[7/b] = {M [a] == 2 ∗ x ∧ M [b] == 2 ∗ x+ 1;M [a] == 3 ∗ x; M [b] == 7} = {{M [a] == 2 ∗ x ∧ M [b] == 2 ∗ x+ 1}; {M [a] == 3 ∗ x ∧ M [b] == 7}} 6. Similarly, in special occasions indicated by dependence analysis, a sequential for quantifier in a logical assertion can be converted into a standard universal quantifier (forall). 6. Property Let L1 and L2 be two logical assertions and σ0 be a state. Then, executeP (σ0, f(L1;L2)) = executeP (σ0, f(L1); f(L2)) = executeP (executeP (σ0, f(L1)), f(L2)). This property states that if a logical assertion L can be separated into two sub-assertions L1 and L2 connected with a semicolon, then executing program obtained from L is semantically equivalent to executing program obtained from L1 and later continuing by executing the program obtained from L2. Note however that, f(L1;L2) 6= f(L1); f(L2), since the program on the left hand side is an optimized version of the program on the right hand side. These programs are only semantically equivalent, shown as: f(L1;L2) ≡ f(L1); f(L2). A special case of this property can be written for the update operation: executeP (σ0, f(L[v/a]) = executeP (executeP (f(L)), f(M [a] == v)). This property is used extensively throughout the inductive step of the proof. 7. Modifications on the Control Flow Graph Before Processing Before symbolic execution, the loop hierarchy of the control flow graph is identified and some modifications on the CFG nodes corresponding to loops are performed. Let L represent a loop in the program, then the following modifications are performed on L: 1. A unique index variable I is defined, and is assigned to L. This variable is inactive before the loop is entered. 2. A new CFG node called “loop entry node” on the forward entry edge of L is created. It contains the statement I = 0. 3. A new CFG node called “loop back edge node” on the loop back edge of L is created. It contains the statement I = I + 1. 4. A new CFG node called “loop exit node” on the loop exit edge of L is created. It contains the statement I = −1. 5. A function IFINISH() that represents the number of iterations this loop is executed is defined. Typically, this function is statically uncomputable. It is a function of the outer loop index variables and other variables that are used inside the loop that affect the number of iterations. These new CFG nodes are simple, single entry single exit nodes. Further, two already existing nodes of the loop are given special names: (1) Loop Exit Branch Node is the only node inside the loop with one of its outgoing edges exiting the loop; and (2) LoopEntryJoinNode is the only node that has an incoming edge from a node outside the loop. Note that any single entry, single exit loop can be transformed into this structure. A loop can be treated as a set of nodes combined to form a virtual super-node called a loop box, as shown in Figure 1.
منابع مشابه
An automatic parametric approach for WCET analysis of C programs
In this paper, we propose a static worstcase execution time (WCET) analysis approach aimed to automatically extract flow information related to program semantics. This information is used to reduce the overestimation of the calculated WCET. We focus on flow information related to loop bounds and infeasible paths. The approach handles loops with multiple exit conditions and non-rectangular loops...
متن کاملPATH BASED EQUIVALENCE CHECKING OF PETRI NET REPRESENTATION OF PROGRAMS FOR TRANSLATION VALIDATION Soumyadip Bandyopadhyay PATH BASED EQUIVALENCE CHECKING OF PETRI NET REPRESENTATION OF PROGRAMS FOR TRANSLATION VALIDATION
A user written application program goes through significant optimizing and parallelizing transformations, both (compiler) automated and human guided, before being mapped to an architecture. Formal verification of these transformations is crucial to ensure that they preserve the original behavioural specification. The PRES+ model (Petri net based Representation of Embedded Systems) encompassing ...
متن کاملTaint Analysis of Security Code in the KLEE Symbolic Execution Engine
We analyse the security of code by extending the KLEE symbolic execution engine with a tainting mechanism that tracks information flows of data. We consider both simple flows from direct assignment operations, and (more subtle) indirect flows inferred from the control flow. Our mechanism prevents overtainting by using a region-based static analysis provided by LLVM, the compiler infrastructure ...
متن کاملSymbolic Execution for Dynamic, Evolutionary Test Data Generation
This paper combines the advantages of symbolic execution with search based testing to produce automatically test data for JAVA programs. A framework is proposed comprising two systems which collaborate to generate test data. The first system is a program analyser capable of performing dynamic and static program analysis. The program analyser creates the control flow graph of the source code und...
متن کاملRelational Symbolic Execution
Symbolic execution is a classical program analysis technique, widely used for program testing and bug finding. In this work we generalize symbolic execution to support program analysis for relational properties, namely properties about two programs, or about two executions of a single program on different inputs. We design a relational symbolic execution engine, named RelSym, which supports tes...
متن کاملScientific Flow Field Simulation of Cruciform Missiles Through the Thin Layer Navier Stokes Equations
The thin-layer Navier-Stokes equations are solved for two complete missile configurations on an IBM 3090-200 vectro-facility supercomputer. The conservation form of the three-dimensional equations, written in generalized coordinates, are finite differenced and solved on a body-fitted curvilinear grid system developed in conjunction with the flowfield solver. The numerical procedure is based on ...
متن کامل